Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Separately check whether pyarrow and cffi are installed #6785

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

mlondschien
Copy link
Contributor

The only tests I can think of would require a runner with pyarrow but not cffi installed.

Note that the LightGBMErrors will only raise when pyarrow is installed, but cffi is not. If pyarrow is not installed, pa_Table is a dummy class and isinstance(data, pa_Table) returns False.

This is a breaking change for users who didn't install lightgbm[arrow], but rather just installed lightgbm and pyarrow separately. Even if not intended, they could previously train a model on a pyarrow.Table, as this was converted via to a scipy.sparse.csr_matrix(data). The fix is simply to install cffi or to transform manually with scipy.sparse.csr_matrix.

Still, it is good to inform people that they are not "natively" training from a pyarrow.Table, incurring an unnecessary copy.

As already suggested in #6782, an alternative would be to raise a warning.

@jameslamb jameslamb changed the title Separately check whether pyarrow and cffi are installed [pyhton-package] Separately check whether pyarrow and cffi are installed Jan 12, 2025
@jameslamb jameslamb changed the title [pyhton-package] Separately check whether pyarrow and cffi are installed [python-package] Separately check whether pyarrow and cffi are installed Jan 12, 2025
@jameslamb jameslamb added the fix label Jan 12, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only tests I can think of would require a runner with pyarrow but not cffi installed.

Could you try adding tests that mock cffi not being available by mocking sys.modules? I tried that in an environment with scikit-learn (another optional dependency of lightgbm) installed and it seemed to work ok:

import sys
from unittest import mock

with mock.patch.dict(sys.modules, {'sklearn': None}):
    import lightgbm as lgb
    print(lgb.compat.SKLEARN_INSTALLED)
    # False

import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# True

We don't have any examples of that in lightgbm's test suite, but I think it'd be interesting to try.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@mlondschien
Copy link
Contributor Author

It appears this does not work. I don't really understand why.

@mlondschien mlondschien requested a review from jameslamb January 16, 2025 09:23
@jameslamb jameslamb mentioned this pull request Jan 23, 2025
31 tasks
@mlondschien
Copy link
Contributor Author

@jameslamb How would you like to continue here?

@jameslamb
Copy link
Collaborator

I feel it'd be easy to accidentally undo this work in future refactorings. I will try to find a way to add a test covering this.

if not PYARROW_INSTALLED:
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` installed.")
if not (PYARROW_INSTALLED and CFFI_INSTALLED):
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.")
raise LightGBMError("Cannot init Dataset from Arrow without `pyarrow` and `cffi` installed.")

This really should be Dataset, not dataframe... I'll make that change when I push testing changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e72d5e2.

In that commit, I also removed backticks from these log messages, in favor of single quotes. Special characters in log messages can occasionally be problematic.

I know these things were already there before this PR, but might as well fix them right here while we're touching these lines.

def missing_module_cffi(monkeypatch):
"""Mock 'cffi' not being importable"""
monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False)
monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came up with this based on https://docs.pytest.org/en/stable/reference/reference.html

I'm hoping that this could establish a pattern we re-use in other tests in future PRs.

It's not perfect (for example, if setting CFFI_INSTALLED is done incorrectly in compat.py, then this approach wouldn't catch that), but it's a lightweight and simple way to ensure we always cover code like the changes introduced in this PR.

Referenced https://docs.pytest.org/en/stable/reference/reference.html while working on this.

@jmoralez @borchero @StrikerRUS what do you think about this approach?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks quite fragile. I think more complicated but right approach would be like one from the following ones: https://stackoverflow.com/a/51048604.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a lot more complicated :/

I'll try it though, I do agree that it'd be a stronger test to go all the way into getting the import to literally raise an ImportError.

generate_dummy_arrow_table(),
label=pa.array([0, 1, 0, 0, 1]),
params=dummy_dataset_params(),
).construct()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.construct() is necessary here... __init_from_pyarrow_table() is not run as part of lgb.Dataset().

@jameslamb jameslamb dismissed their stale review January 26, 2025 21:03

I pushed changes here, my review shouldn't count towards a merge.

@jameslamb
Copy link
Collaborator

Ok, think I found a pattern that'll work for this testing! I just pushed e72d5e2, proposing that and adding some other small fixes.

Let me know if it looks ok to you @mlondschien .

I've also dismissed my review... now that I've made such significant edits here, my review shouldn't count towards a merge. @StrikerRUS @borchero @jmoralez could one of you help with a review?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor suggestions below.

python-package/lightgbm/compat.py Outdated Show resolved Hide resolved
CFFI_INSTALLED = True
except ImportError:
CFFI_INSTALLED = False

class arrow_cffi: # type: ignore
"""Dummy class for pyarrow.cffi.ffi."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need

        CData = None
        addressof = None
        cast = None
        new = None

class members?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CData is needed for type hinting:

chunks: arrow_cffi.CData

But I think the others could be safely removed. It's only showing up in the diff in this PR because this code is being moved around... so this was missed in earlier PRs (I guess #6034).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all but CData in 7396613

with pytest.raises(
lgb.basic.LightGBMError, match="Cannot predict from Arrow without 'pyarrow' and 'cffi' installed."
):
bst = lgb.train(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lgb.train() part should be outside of the with block because we expect only predict() should fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, thank you!

Fixed in 7396613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants